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O-t Abstract 

Sampling from a random discrete distribution induced by a 'stick-breaking' process is considered. 
_C ■ Under a moment condition, it is shown that the asymptotics of the sequence of occupancy numbers, 

and of the small-parts counts (singletons, doubletons, etc) can be read off from a limiting model 
involving a unit Poisson point process and a self-similar renewal process on the halflinc. 

1 Introduction 

>. 

A multiplicative renewal process (also known as residual allocation model or stick-breaking) is a random 
*^ ■ sequence B = (Pj : j = 0, 1, . . .) of the form 

en 

^t-' : Pj = HWi, (1.1) 

O ' i=i 

00 

(so Pq = 1) where (Wj : i = 1, 2, . . .) are independent copies of a random variable W taking values in ]0, 1[. 
We shall assume that the support of the distribution of W is not a geometric sequence or, equivalently, 
that the distribution of the variable | log W\ is non-lattice, and also assume that 



^:=E[-logT^] < oo. (1.2) 

The 'stick-breaking' set B will be viewed as a simple point process, with being the only accumulation 
point. The complement B c — [0, 1] \ B is an open set comprised of the component intervals ]Pj+i, Pj[ 
for j = 0, 1, . . .. 

Let Ui, U2, ... be independent uniform [0, 1] random points, also independent of B, and for each n let 
U nt i < . . . < U n . n be the order statistics of Ui, . . . , U n . These data define a random occupancy scheme, 
in which a collection of n 'balls' XJ\, . . . , U n is sequentially sorted into 'boxes' ]Pj, Pj-x [ , j = 1,2,.... 
In the most studied and analytically best tractable case the law of W is beta(6*, 1), and the allocation 
of 'balls-in-boxes' belongs to the circle of questions around the Ewens sampling formula [2H]. Let K n 
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be the number of occupied 'boxes' and K n ,r be the number of 'boxes' occupied by exactly r 'balls', so 
Sr>o Kn.r ~ Kn and J2 r >o r -^n,r = n. We also define K n fi to be the number of unoccupied interval 
components of B c n [U n> i, 1], so that K n = I n — K n< o, with I n := min{i : Pi < U n ,x} being the index of 
the leftmost occupied interval. 

In [51 [B] the renewal theory was applied to explore the spectrum of possible limit laws for K n , including 
normal, stable and Mittag-Leffler distributions. In the present note we focus on the variables K n ^ r . We 
approach the K n ys via the occupancy counts 

Z® :=#{1< 3 <n: U 3 G ]P In -i+i, Pj„-<[}, t=l,2,... 

in the left-to- right order of intervals, where we adopt the convention Z n = for i > I n . Extending a 
result from [5J about Zn , we will show that the Z n s jointly converge to the sequence of occupancy 
numbers in a limiting model that involves a Poisson process and another self-similar point process on the 
halfline. 

From a viewpoint, B is an exponential transform of the range of a subordinator with finite Levy 
measure, that is of a compound Poisson process. Asymptotics of K n ,K n ^s have been studied in a 
similar occupancy model for subordinators with infinite Levy measures [2J [5J [S] . In the infinite measure 

(i) 

case neither the counts Z n nor I n can be defined, because B is then a random Cantor set, hence there 
are infinitely many unoccupied intervals between any two occupied components of B c . 

2 Occupancy counts 

For 1 < m < n the probability that the interval ]P\, Po[ contains m out of n uniform points is 

pin : m) = ( " V[PF"- m (l - W) m ]. 

Let (ni, . . . , rik) be a weak composition of n, meaning that n\ > 0, ni > 0, . . . , nk > and n\+. . .+nk = n. 
The structure (jl.ip and elementary properties of the uniform distribution imply the product formula for 
the probability that the intervals ]Pj,Pj-i[ contain rij uniform points, 



p(m + . . . + n k : nk)p(ni +... + Uk-i ■ rik-x) ■ ■ -p{ni : m), (2.1) 

m,... ,n k j 

where the multinomial coefficient can be factored as rii-i (™ 1+ n " + ™'0- While this formula implies in an 
easy way the joint distribution of the occupancy counts read right-to-left, there is no simple formula for 
the joint distribution of the counts read left-to-right. We will see that in the n — > oo limit there is a 
considerable simplification, as in [7]. 

Observe that Z n := (Zn : z = 1,2,...) can be defined in the same 'balls-in-boxes' fashion in terms 
of the inflated sets nB and U n :— {nU n j '■ 1 < j < n}. From the extreme-value theory we know 
that, as n — > oo, the point process U n converges vaguely to a unit Poisson process U on R + , Here and 
henceforth, the vague convergence means weak convergence on every finite interval bounded away from 
0. On the other hand, nB also converges vaguely to some point process B on K + which is self-similar, 
that is satisfies cB =d B for every c > 0. The convergence of nB is a consequence of the classical renewal 
theorem applied to the finite- mean random walk {— logPj : j € No}- The self-similarity in this context 
is analogous to the stationarity in the (additive) renewal theory. 

The set R+ \B is itself a collection of open intervals ('boxes') which accumulate in some way the points 
of U ('balls'), hence we can define a nonnegative sequence of counts of 'balls-in-boxes' Z := (Z^' : i = 
1,2,.. .) which starts with some positive number Z^ of Poisson points falling in the leftmost nonempty 
interval. In view of the convergence of the point processes, one can expect that the convergence of the 
counting sequences also holds. 

Theorem 2.1. As n — * oo 7 

(ZW,Z( 2 ),...)^(^U (2) ,..). (2-2) 



The distribution of the limit sequence is given by the formula 

P{Z< 1 >=m,...,zW=n*} = 

1 fn-y + . . . + ni . 

p(ni + . . . + ni : ni)p(ri\ + . . . + ri^-i : u^_ij . . . p(ni : ni) (2.3) 



fi(ni + . . . +m) V ni,...,nt 

for any £ > 0, n\ > and n,2, ■ ■ ■ ,nt > 0. 

Proof. Fix e > and restrict all point processes to [e, e -1 ]. By Skorohod's theorem we can select 
probability space in such a way that the convergence of (nB,U n ) to (B,U) holds almost surely, then 
for continuity reasons (see Lemma \3. II to follow) the occupancy numbers of the intervals within [e, e _1 ] 
converge. The weak convergence (|2.2j) follows by sending e — > and noting that the probability that any 
m leftmost points of U fit in [e, e _1 ] goes to one. 

Let n = ni + ■ ■ ■ + ni and denote by X the (n + l)st leftmost point of U. The generic sequence of 
occupancy numbers which gives rise to the event in (|2.3|) is of the form (m, . . . , Hi, 0, . . . , 0, m) where 
m is some positive number and the number of 0's is arbitrary. Let G = max(£> n [0,X]) be the largest 
point of B smaller than X; from selfsimilarity and [7] we know that the distribution of G/X has den- 
sity (fix)~ 1 P{W < x} on [0, 1], and from the order statistics property of the Poisson process we know 
that given X the first n points of U are distributed as a uniform sample from [0, X]. The pattern 
(m, . . . , Ti£,0, . . . ,0, m) occurs when the uniform n-sample does not hit [G, X] (event E{) and within 
[0, G] the occupancy numbers are (rii, . . . , ni, 0, . . . , 0) (event E^). Integrating by parts, the probability 
of Ei is 



I 



',»P { H.-<,) dT = i _ (1 _ E[in) 



fix \xn 



For % = 0, 1, ... let Ei^ be the event that the pattern (m, . . . ,ni, 0, . . . ,0) with exactly i zeroes occurs. 
In view of the equality (^S) P|[0, 1] =d B \ {1}, the conditional probability P{E2,i\X = x,Ei} equals 
the probability (|2.ip with k = I + i and n^+i = • • • = nu = 0. Since E% — Ui^o -^2,i , summing the last 
probabilities over i, we have 

¥{E 2 \X = x, Ei} = — — — -( ' )p(ni + ... + <!%£ : n t )p(ni + . . . + ne~i ■ ni-i) . ..p(ni : m) = 

1 -E[T4™J \ni,.. .,mj 

= ¥{E 2 \Ei\. 
Since the probability in (|2.3p equals P{Ei f]E2}, the proof is complete. □ 

3 r- counts 

We wish to connect the asymptotics of r-counts to Theorem 12. II Let Y be the leftmost atom of U. For 
r > let K* be the number of intervals oi]Y,oo[\B that contain exactly r points of U. For r > we can 
take ]0, oo [ instead of )Y, oo[ in this definition. 

Lemma 3.1. Let A, B be two simple (i.e. without multiple points) point processes defined and a.s. finite 
in some interval [s,t], and such that A n B = a.s. Suppose we have weak convergence (A n ,B n ) —> c i 
(A, B) for a sequence of bivariate point processes. Define a gap to be a subinterval of [s, t] whose endpoints 
are consecutive atoms of B. Let Lk be the number of gaps in B that contain exactly k points of A (with 
the convention that Lq counts the gaps to the right of the leftmost A-point in [s, t\), and let L n ^ be defined 
similarly in terms of (A n , B n ). Then (£„.o, i n ,i, ■ ■ ■) - >d (Lq, Li, . . .) as m — > oo. 

Proof. By Skorohod's theorem a version of the processes can be defined on some probability space in 
such a way that with probability one the convergence is pointwise. That is to say, for large enough n, 
#B n and #5 are equal and the points of B n (labelled, e.g. in the increasing order) are e-close to the 
points of B. Same for A ni A. Thus for large n, there is a bijection between the gaps in B and in B n and 
between the points of A and A n that fall in each particular gap. □ 



A variation of the lemma allows accumulation of atoms of the gaps-generating process at the left endpoint 
of the underlying interval. In our situation both B and Li live on the halfline and accumulate at infinity, 
hence to pass from the occupancy counts to K n ^s we need to take further care by showing that the 
contribution of the counts within [s,oo] is for large s negligible. To this end, it is enough to work with 
expected values. 

Now, the mean contribution of [0, s] to E[if*] can be estimated by the expected number of points in 
Bn [min(Y, s),s], 



E 



i(Y,i 



dx 

fix 



z dz 



dx 



min(z,s) f^ x JO 



z dz 



dx 



< CO. 



fix 
v I [L, where 



Lemma 3.2. We have E[K*] = {fir)- 1 for r > 0, and also E[K$] 

v := E[- log(l - W)] 
may be finite or infinite. 

Proof. Indeed, by the renewal theory the intensity measure of the process B is (fix)~ l dx. Understanding 
a possible B-sdom. in da; as the right endpoint of a gap we obtain for r > 



E[K* 



E 



c{1 _ w) x r (i-wy dx 

r! 



10 ' • iix. 

For r > setting s = we obtain E[X*] = (/xr) -1 . For r 



fir'. 



■E 



e y y r dy 



1 
fir 



we have 



\K* 



E 



-(l-W)x 



(l-e~ wx ) 



dx 
fix 



E[-log(l-VF)] 



v 



where the second factor in the integrand stands for the event that X is smaller than the left endpoint of 
the gap. □ 

In the case v — oo the source of divergence of Kq is oo and not 0, as one sees by checking that the mean 
number of £>-points in [min(y, s), s] is finite for every s > 0. 

Proposition 3.3. The conditions v = oo and K^ = oo a.s. are equivalent. 

Proof. If Kq = oo a.s. then v = oo by Lemma l3.2l The proof in the other direction follows by application 
of the Kochen-Stone extension of the Borel-Cantelli lemma. □ 

Theorem 3.4. As n — ► oo we have 

(K n ,0i K 7h i, . . .) — »d (Kq, K*, . . .), 

along with the convergence of expectations 

E[K n>r ] -» e[k;], 

where the limit may be finite or infinite for r = 0. 

Proof. The limit set satisfies B (~l [0, 1] =g WqB where B and Wq are independent, and Wq has the 
density (fix)~ 1 P(W < x)dx on [0, 1]. We shall speak of [recounts meaning the intervals within [f7 ni i, 1] 
(or [X, oo], depending on the context) that contain at most r sampling points, and we denote KT-, = 
Y^i=o ^ri Kn.[r] = S[=o ^n,r- Replacing in the proof of Lemma T3. 21 the lower limit of integration by 
sWq we see that choosing s large enough we can achieve that the contribution to E[.K7,] of the intervals 
with right endpoint in [sWo, oo] is arbitrarily small. It remains to show that the contribution to E[if„j r i] 
of [s/n, 1] is small for large enough s uniformly in n. 

Observe that the number of components of B c (~l [e, 1] that contain no more than r uniform points is 
nonincrcasing with n, because the number of 'balls' in a 'box' can only grow as more 'balls' are thrown. 
Furthermore, observe that, for the purpose of estimate, the fixed-n uniform sample can be replaced by 



the Poisson sample of rate n on [0, 1]. Indeed, the probability that a gap of size x is hit by r uniform 
points is (™)x r (l — a;)™ -1 , and in the possonised model it is e~ nx (nr) r jr\ for r > 0, while for r — wc 
have (1 — x) n versus e~ nx . In the range 1/2 < x < 1 we have elementary estimates c\e~ x < 1 — x < C2e~ x 
for suitable positive c\, Ci, which allow to show that the mean number of [rj-counts coming from [s/n, 1] 
is of the same order for both models. The intervals of size larger 1/2 can be ignored, since the probability 
that they accomodate r or less sample points decays exponentially with n. 

Arguing within the framework of Poisson sample UC\ [0, 1], we compare occupancy of 'boxes' generated 
by B with that for W$B. The 'meander interval' [Wo, 1] gives negligible contribution to [rj-counts hence 
will be ignored. Because B(l[Wos/n, Wo] is a zoomed-in copy of [s/n, 1], the sequence of occupancy counts 
for B fl [Wqs/ti, Wo] has the same distribution as if we had B n [s/n, 1] in the role of 'boxes' and a mixed 
Poisson process with rate nWo in the role of 'balls'. By monotonicity and because Wo < 1, the number 
of [recounts derived from B fl [Wos/n, Wo] is larger than the number of r-counts from B n [s/n, Wo], 
therefore the mean number of such counts can be kept small by the choice of s. This implies the desired 
estimate of the contribution of [s/n, Wo] to KK n t r -\. □ 
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